SDA 4.1 Documentation for CORREL
NAME
correl - Correlation coefficients
USAGE
correl -b batchfile
DESCRIPTION
CORREL generates, by default, Pearson correlation coefficients
among pairs of specified variables. Or the natural logarithm of
the odds ratios can be calculated instead. A weight variable can
be used to give different weights to each case, and filter
variables may be used to exclude some of the cases.
If a case has missing data on ANY of the specified variables, by
default it is excluded from all the calculations. However, there
is an option to exclude cases pairwise -- that is, to calculate
each correlation coefficient using all cases having valid data on
that PAIR of variables.
Ordinarily this program is invoked by the Web interface for the
SDA programs, and the user does not have to deal with the
keywords given in this document. Output from the program is in
HTML, which can be viewed with a Web browser.
It is also possible to run the program directly by preparing a
batch command file, which specifies the variables to be analyzed
and the options to use. This document explains how to prepare
such a file. The name of this batch command file is specified to
the program after the `-b' option flag.
CONTENTS OF THIS DOCUMENT
KEYWORDS
The batch file contains specifications for the analysis. These
specifications are given in the form "keyword = something" with
one keyword per line. Keywords may be given in any order, either
in upper or in lower case. The valid keywords are as follows
(with significant characters shown in capital letters):
Keyword Possible Specification Default (if no keyword)
_____________________________________________________________________
STUdy= path(s) of dataset(s) Look for variables in
current directory only
Vars= names of vars to correlate REQUIRED
(separated by spaces/commas)
Weight= name of weight variable No weighting
Filter= name(s) and codes of filter No filter
variable(s)
GVARCase= LOWER or UPPER No force to lower/upper case
MD= Pairwise Cases with any MD are
excluded
SAvefile= filename to receive output Output sent to screen
(overwrite existing file) (standard output)
TExt= Yes No text for variables
LAnguagefile= Name of file with non-English English labels on
labels and messages output
RUNtitle= Title or comments for run No title or comments
Main Statistic to Display
The main statistic to display in each cell of the matrix can be
one of two options: the Pearson correlation coefficient, or the
log of the odds ratio. The default main statistics to display
are the Pearson correlation coefficients.
For each statistic the user can specify the number of desired
decimal places (in parentheses, after the name of the statistic).
See
below
for the default number of decimals for each statistic. Since the
default main statistic is the Pearson correlation coefficient, it
is not necessary to specify that statistic unless you want to
change the number of decimal places to display.
It is possible to reverse the sign of one or more of the
variables. This may be desirable, for example, in order to have
all of the expected correlations positive. (See the discussion
below
for more on this option.) Then a negative correlation will stand
out as being unexpected. If you want to reverse the sign of a
variable, give its index position after the 'reverse=' keyword.
A variable's index position is its relative position after the
'vars=' keyword. See the last example
below.
Keyword Possible Specification Default (if no keyword)
_____________________________________________________________________
MAINstat= CORR (ndec) Display correlations,
LOGodds (ndec) with default number
of decimal places
REVerse= list Do not reverse the signs
(separated by spaces/commas) of variables
Other Statistics to Display
In addition to the main statistic, several optional statistics
can be displayed. You can specify the desired number of decimal
places in parentheses if the
default numbers of decimals
(listed below) are not satisfactory.
- Standard errors of the correlations.
These statistics are placed in a matrix, beneath the matrix of
correlation coefficients. See
below
for a note on their calculation.
- Cronbach's alpha coefficient
This is a function of the average correlation between the
variables.
- Univariate statistics.
The statistics available for each variable include its mean,
standard deviation, standard error, valid N of cases, and (if
there is a weight variable) valid weighted N of cases.
- Paired statistics.
These statistics are available if the 'md=pairwise' option is
specified. The paired statistics displayed are the same as the
univariate statistics, minus the standard errors. Each statistic
is based on the number of valid cases for that pair
of variables. Note that the number of valid cases for
various pairs of variables can be very different from one
another.
- P-Square statistics.
For an explanation of the PSQ statistic, see
below.
Keyword Possible Specification Default (if no keyword)
_____________________________________________________________________
OTHERstats=
SECOR (ndec) No standard errors of
the correlations
ALPHA (ndec) No alpha coefficient
(Univariate statistics)
MEANs (ndec) No means
SD (ndec) No standard deviations
SEVAR (ndec) No standard errors
Ncases No unweighted N's
WNcases (ndec) No weighted N's
(Paired statistics)
PMEANs (ndec) No paired means
PSD (ndec) No paired std devs
PSEVAR (ndec) No paired std errs
PNcases No paired N's
PWNcases (ndec) No paired weighted N's
PSQ= list1 ; list2 (ndec) No P-square statistics
(see below)
Note that the 'otherstats=' keyword can be repeated on subsequent
lines if necessary.
MORE STATISTICAL INFORMATION
DICHOTOMIZING VARIABLES FOR ODDS RATIOS
The calculation of an odds ratio assumes that each of the two
variables in a pair has only two categories. If these statistics
are requested, CORREL treats all of the specified variables as
dichotomies, regardless of the number of categories they may
actually have. The minimum valid value of each variable is
treated as one category, and all valid values greater than the
minimum are combined into the other category. If this default
dichotomization is not appropriate for a particular variable, you
can recode the variable within CORREL by using the standard SDA
temporary recoding syntax.
CALCULATION OF STANDARD ERRORS
If standard errors are requested, they are computed with the
standard formulas for each statistic or its transformation,
assuming simple random sampling. Note that
the confidence interval for the Pearson correlation coefficient
is not symmetric; therefore, there is no single standard error
that applies in both directions. CORREL outputs the average
distance of the upward and the downward confidence band for one
standard error (based on the retransformation of Fisher's Z),
since that number is ordinarily a useful approximation.
The calculation of the standard error of the correlation
coefficient in each cell is based by default on the UNWEIGHTED
number of cases, even if a weight variable has been used for
calculating the correlation coefficient. Ordinarily this
procedure will generate a more appropriate statistical test than
one based on the weighted N in each cell.
CALCULATION OF P-SQUARE STATISTICS
The p-square statistic is an index of proportionality for the
rows in a correlation matrix. (The correlation matrix is usually
a matrix of Pearson correlations, although the p-square procedure
will also work with the logs of odds ratios. In such a case,
however, be aware of how the variables are dichotomized.)
If all of the correlation coefficients in one row are exactly
double the size of the coefficients in another row, for example,
there is a constant proportionality, and the index will be 1.0.
Usually this statistic is used to examine the consistency of the
relationships of several items (defining the rows of the matrix)
in respect to a number of criterion variables (defining the
columns of the matrix). For a discussion of the use of this
statistic for creating scales, see Thomas Piazza, "The Analysis
of Attitude Items," American Journal of Sociology,
vol. 86 (1980) pp. 584-603.
The `PSQ=' keyword allows you to specify which items should be
used for the rows (list1), and which items should be used as the
criterion variables (list2). Each list is a set of numbers,
referring to the order in which the variables were specified
after the `Vars=' keyword. Each list can consist of single
numbers or ranges, separated by commas or blanks. The two lists
are separated by a semicolon. An example is given
below.
DECIMAL PLACES
Each statistic has a default number of decimal places with which
it will be printed. To change the default, put the desired
number of decimals in parentheses after specifying the statistic
(or package of statistics). The default number of decimal places
for the main statistics (correlations and logs of odds ratios) is
2 places. For their standard errors the default is 3 places.
For the alpha statistic, the default is 2 placas. The defaults
for the univariate and the paired statistics are: means (2), std
deviations (2), std errors (3), and wncases(0). It is not
necessary to request the 'correlation' main statistic unless you
want to change the number of decimal places; unless otherwise
specified, the Pearson correlation coefficient is the statistic
that will be displayed.
ADDITIONAL INFORMATION
ABBREVIATIONS FOR KEYWORDS
Keywords can usually be abbreviated down to the number of
characters required to differentiate them from other keywords.
The keyword for the names of the variables, for instance, can be
given as `variables=' or `vars=' or even `v='. Either upper or
lower case may be used. In the list of keywords given above, the
minimum set of characters for each keyword is capitalized.
Mention of Keyword Sufficient
The form `keyword=yes' may be shortened to `keyword'. That is,
the `=yes' may be omitted for those options which require no
further specification. For example, `text=yes' can be shortened
to `text'.
COMMENTS
Anything on a line beginning with "#" is ignored by the batch
processor and can therefore be used for comments. Blank lines
are also ignored.
REPETITION OF KEYWORDS
If there is not enough room on a line to list all of the desired
variables, the keyword can be repeated on a new line, and more
variables can be listed. In such a case the second list is
appended to the first list, for purposes of generating tables.
This appending feature applies to the keywords for specifying the
variables to be correlated, the filter variables, and the
`otherstats=' keyword. It also applies to the 'study=' keyword,
for specifying the locations of the SDA dataset directories. If
other keywords are repeated, the program will print an error
message and stop.
REVERSING THE SIGNS OF VARIABLES
It is often useful to reverse the sign of the correlation
coefficients of one or more variables with the other variables.
In a group of attitudinal variables, for example, some variables
might be coded so that a high score means a liberal response,
while other variables might be coded so that a high score means a
conservative response. In the correlation matrix the correlation
coefficients with a variable like 'age' might then be expected to
be positive for the "high = conservative" items and negative for
the "high = liberal" items.
If the correlation matrix has more than a few items, it will be
easier to interpret the correlations if all of the attitudinal
items are scored so that a high score means "conservative" (or
"liberal" -- either way). However, it is not necessary to
actually recode the items to achieve this goal. The CORREL
program allows you to specify one or more items for a reversal of
the signs you would otherwise get with those items. Then a
departure from the expected sign will be easier to detect. For
example, if the "high = liberal" items in the group have their
signs reversed, then ALL of the attitudinal correlations with
'age' might be expected to be positive. So if one or more of the
items have a negative correlation with 'age' it will be more
obvious that the items in question are measuring something
different from what the other items are measuring.
EXAMPLES OF BATCH FILES
Basic example
study = /sa/testdata
vars = spend spend2 spend3 spend4
savefile = mymatrix.htm
Use weight and filter variables, and request some
univariate statistics and descriptive text for the variables.
vars = spend spend2 spend3 spend4
otherstats = means, ncases
weight= wtvar
filters= age(18-50) gender(1)
text = yes
savefile = mymatrix.htm
Generate a P-square matrix of the four 'spend'
variables, using age, educ, and sex as the criterion variables.
Also request 3 decimal places.
vars = spend spend2 spend3 spend4 age educ sex
psq = 1-4; 5-7 (3)
runtitle= Test run to demonstrate P-square stats
savefile= mypsq.htm
Reverse the sign of the correlations involving two of
the four 'spend' variables -- the 2nd and 4th mentioned after the
'vars=' keyword.
vars = spend spend2 spend3 spend4
reverse = 2 4
text
runtitle= Test run to demonstrate reversing signs
savefile= mytest.htm
CSM, UC Berkeley/ISA
February 11, 2021